The Wikipedia Corpus
نویسندگان
چکیده
Wikipedia, the popular online encyclopedia, has in just six years grown from an adjunct to the now-defunct Nupedia to over 31 million pages and 429 million revisions in 256 languages and spawned sister projects such as Wiktionary and Wikisource. Available under the GNU Free Documentation License, it is an extraordinarily large corpus with broad scope and constant updates. Its articles are largely consistent in structure and organized into category hierarchies. However, the wiki method of collaborative editing creates challenges that must be addressed. Wikipedia’s accuracy is frequently questioned, and systemic bias means that quality and coverage are uneven, while even the variety of English dialects juxtaposed can sabotage the unwary with differences in semantics, diction and spelling. This paper examines Wikipedia from a research perspective, providing basic background knowledge and an understanding of its strengths and weaknesses. We also solve a technical challenge posed by the enormity of text (1.04TB for the English version) made available with a simple, easily-implemented dictionary compression algorithm that permits time-efficient random access to the data with a twenty-eight-fold reduction in size.
منابع مشابه
Wikipedia Mining Wikipedia as a Corpus for Knowledge Extraction
Wikipedia, a collaborative Wiki-based encyclopedia, has become a huge phenomenon among Internet users. It covers a huge number of concepts of various fields such as Arts, Geography, History, Science, Sports and Games. As a corpus for knowledge extraction, Wikipedia’s impressive characteristics are not limited to the scale, but also include the dense link structure, word sense disambiguation bas...
متن کاملWikiCoref: An English Coreference-annotated Corpus of Wikipedia Articles
This paper presents WikiCoref, an English corpus annotated for anaphoric relations, where all documents are from the English version of Wikipedia. Our annotation scheme follows the one of OntoNotes with a few disparities. We annotated each markable with coreference type, mention type and the equivalent Freebase topic. Since most similar annotation efforts concentrate on very specific types of w...
متن کاملBetween Comparable and Parallel: English-Czech Corpus from Wikipedia
We describe the process of creating a parallel corpus from Czech and English Wikipedias using methods which are language independent. The corpus consists of Czech and English Wikipedia articles, the Czech ones being translations of the English ones, is aligned on sentence level and is accessible in Sketch Engine corpus manager.1
متن کاملAutomatically Linking GermaNet to Wikipedia for Harvesting Corpus Examples for GermaNet Senses
The comprehension of a word sense is much easier when its usages are illustrated by example sentences in linguistic contexts. Hence, examples are crucially important to better understand the sense of a word in a dictionary. The goal of this research is the semi-automatic enrichment of senses from the German wordnet GermaNet with corpus examples from the online encyclopedia Wikipedia. The paper ...
متن کاملLearning Named Entity Recognition from Wikipedia
We present a method to produce free, enormous corpora to train taggers for Named Entity Recognition (NER), the task of identifying and classifying names in text, often solved by statistical learning systems. Our approach utilises the text of Wikipedia, a free online encyclopedia, transforming links between Wikipedia articles into entity annotations. Having derived a baseline corpus, we found th...
متن کاملDisentangling the Wikipedia Category Graph for Corpus Extraction
In several areas of research such as knowledge management and natural language processing, domain-specific corpora are required for tasks such as terminology extraction and ontology learning. The presented investigations herein are based on the assumption that Wikipedia can be used for the purpose of corpus extraction. It presents the advantage of possessing a semantic layer, which should ease ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011